{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BIDMach: basic classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this tutorial, we'll BIDMach's GLM (Generalized Linear Model) package. It includes linear regression, logistic regression, and support vector machines (SVMs). The imports below include both BIDMat's matrix classes, and BIDMach machine learning classes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import $exec.^.lib.bidmach_notebook_init\n",
    "if (Mat.hasCUDA > 0) GPUmem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset: Reuters RCV1 V2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is the widely used Reuters news article dataset RCV1 V2. This dataset and several others are loaded by running the script <code>getdata.sh</code> from the BIDMach/scripts directory. The data include both train and test subsets, and train and test labels (cats). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "var dir = \"../data/rcv1/\"  // Assumes bidmach is run from BIDMach/tutorials. Adjust to point to the BIDMach/data/rcv1 directory\n",
    "tic\n",
    "val train = loadSMat(dir+\"docs.smat.lz4\")\n",
    "val cats = loadFMat(dir+\"cats.fmat.lz4\")\n",
    "val test = loadSMat(dir+\"testdocs.smat.lz4\")\n",
    "val tcats = loadFMat(dir+\"testcats.fmat.lz4\")\n",
    "toc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "BIDMach's basic classifiers can invoked like this on data that fits in memory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val (mm, opts) = GLM.learner(train, cats, GLM.logistic)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last option specifies the type of model, linear, logistic or SVM. The syntax is a little unusual. There are two values returned. The first <code>mm</code> is a \"learner\" which includes model, optimizer, and mixin classes. The second <code>opts</code> is an options object specialized to that combination of learner components. This design facilitates rapid iteration over model parameters from the command line or notebook. \n",
    "\n",
    "The parameters of the model can be viewed and modified by doing <code>opts.what</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "opts.what\n",
    "opts.lrate=0.3f"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most of these will work well with their default values. On the other hand, a few have a strong effect on performance. Those include:\n",
    "<pre>\n",
    "lrate: the learning rate\n",
    "batchSize: the minibatch size\n",
    "npasses: the number of passes over the dataset\n",
    "</pre>\n",
    "We will talk about tuning those in a moment. For now lets train the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "opts.npasses=2\n",
    "mm.train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output includes important information about the training cycle:\n",
    "* Percentage of dataset processed\n",
    "* Cross-validated log likelihood (or negative loss)\n",
    "* Overall throughput in gigaflops\n",
    "* Elapsed time in seconds\n",
    "* Total Gigabytes processed\n",
    "* I/O throughput in MB/s\n",
    "* GPU memory remaining (if using a GPU)\n",
    "\n",
    "The likelihood is calculated on a set of minibatches that are held out from training on every cycle. So this is a cross-validated likelihood estimate. Cross-validated likelihood will increase initially, but will then flatten and may decrease. There is random variation in the likelihood estimates because we are using SGD. Determining the best point to stop is tricky to do automatically, and is instead left to the analyst. \n",
    "\n",
    "To evaluate the model, we build a classifier from it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val (pp, popts) = GLM.predictor(mm.model, test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And invoke the predict method on the predictor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pp.predict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although ll values are printed above, they are not meaningful (there is no target to compare the prediction with). \n",
    "\n",
    "We can now compare the accuracy of predictions (preds matrix) with ground truth (the tcats matrix). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val preds = FMat(pp.preds(0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val lls = mean(ln(1e-7f + tcats ∘ preds + (1-tcats) ∘ (1-preds)),2)  // actual logistic likelihood\n",
    "mean(lls)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A more thorough measure is ROC area:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val rocs = roc2(preds, tcats, 1-tcats, 100)   // Compute ROC curves for all categories\n",
    "plot(rocs(?,6))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(rocs(?,0->5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val aucs = mean(rocs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aucs(6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could go ahead and start tuning our model, or we automate the process as in the next worksheet."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".scala",
   "mimetype": "text/x-scala",
   "name": "scala211",
   "nbconvert_exporter": "script",
   "pygments_lexer": "scala",
   "version": "2.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
